Bishkek
KyrgyzBERT: A Compact, Efficient Language Model for Kyrgyz NLP
Metinov, Adilet, Kudakeeva, Gulida M., Kabaeva, Gulnara D.
Kyrgyz remains a low-resource language with limited foundational NLP tools. To address this gap, we introduce KyrgyzBERT, the first publicly available monolingual BERT-based language model for Kyrgyz. The model has 35.9M parameters and uses a custom tokenizer designed for the language's morphological structure. To evaluate performance, we create kyrgyz-sst2, a sentiment analysis benchmark built by translating the Stanford Sentiment Treebank and manually annotating the full test set. KyrgyzBERT fine-tuned on this dataset achieves an F1-score of 0.8280, competitive with a fine-tuned mBERT model five times larger. All models, data, and code are released to support future research in Kyrgyz NLP.
Human-Annotated NER Dataset for the Kyrgyz Language
Turatali, Timur, Alekseev, Anton, Jumalieva, Gulira, Kabaeva, Gulnara, Nikolenko, Sergey
We introduce KyrgyzNER, the first manually annotated named entity recognition dataset for the Kyrgyz language. Comprising 1,499 news articles from the 24.KG news portal, the dataset contains 10,900 sentences and 39,075 entity mentions across 27 named entity classes. We show our annotation scheme, discuss the challenges encountered in the annotation process, and present the descriptive statistics. We also evaluate several named entity recognition models, including traditional sequence labeling approaches based on conditional random fields and state-of-the-art multilingual transformer-based models fine-tuned on our dataset. While all models show difficulties with rare entity categories, models such as the multilingual RoBERTa variant pretrained on a large corpus across many languages achieve a promising balance between precision and recall. These findings emphasize both the challenges and opportunities of using multilingual pretrained models for processing languages with limited resources. Although the multilingual RoBERTa model performed best, other multilingual models yielded comparable results. This suggests that future work exploring more granular annotation schemes may offer deeper insights for Kyrgyz language processing pipelines evaluation.
- South America > Argentina (0.14)
- Asia > Kyrgyzstan > Chüy Region > Bishkek (0.05)
- Asia > Russia (0.04)
- (11 more...)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)
Mechanistic Interpretability with SAEs: Probing Religion, Violence, and Geography in Large Language Models
Simbeck, Katharina, Mahran, Mariam
Despite growing research on bias in large language models (LLMs), most work has focused on gender and race, with little attention to religious identity. This paper explores how religion is internally represented in LLMs and how it intersects with concepts of violence and geography. Using mechanistic interpretability and Sparse Autoencoders (SAEs) via the Neuronpedia API, we analyze latent feature activations across five models. We measure overlap between religion- and violence-related prompts and probe semantic patterns in activation contexts. While all five religions show comparable internal cohesion, Islam is more frequently linked to features associated with violent language. In contrast, geographic associations largely reflect real-world religious demographics, revealing how models embed both factual distributions and cultural stereotypes. These findings highlight the value of structural analysis in auditing not just outputs but also internal representations that shape model behavior.
- North America > United States > New York > New York County > New York City (0.28)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > Palestine > Gaza Strip > Gaza Governorate > Gaza (0.14)
- (225 more...)
Syntactic Transfer to Kyrgyz Using the Treebank Translation Method
Alekseev, Anton, Tillabaeva, Alina, Kabaeva, Gulnara Dzh., Nikolenko, Sergey I.
The Kyrgyz language, as a low-resource language, requires significant effort to create high-quality syntactic corpora. This study proposes an approach to simplify the development process of a syntactic corpus for Kyrgyz. We present a tool for transferring syntactic annotations from Turkish to Kyrgyz based on a treebank translation method. The effectiveness of the proposed tool was evaluated using the TueCL treebank. The results demonstrate that this approach achieves higher syntactic annotation accuracy compared to a monolingual model trained on the Kyrgyz KTMU treebank. Additionally, the study introduces a method for assessing the complexity of manual annotation for the resulting syntactic trees, contributing to further optimization of the annotation process.
- Asia > Russia (0.05)
- Europe > Russia > Northwestern Federal District > Leningrad Oblast > Saint Petersburg (0.04)
- Asia > Kyrgyzstan > Chüy Region > Bishkek (0.04)
- (8 more...)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
HJ-Ky-0.1: an Evaluation Dataset for Kyrgyz Word Embeddings
Alekseev, Anton, Kabaeva, Gulnara
One of the key tasks in modern applied computational linguistics is constructing word vector representations (word embeddings), which are widely used to address natural language processing tasks such as sentiment analysis, information extraction, and more. To choose an appropriate method for generating these word embeddings, quality assessment techniques are often necessary. A standard approach involves calculating distances between vectors for words with expert-assessed 'similarity'. This work introduces the first 'silver standard' dataset for such tasks in the Kyrgyz language, alongside training corresponding models and validating the dataset's suitability through quality evaluation metrics.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Germany > Saxony > Leipzig (0.09)
- Asia > Russia (0.05)
- (7 more...)
KyrgyzNLP: Challenges, Progress, and Future
Alekseev, Anton, Turatali, Timur
Large language models (LLMs) have excelled in numerous benchmarks, advancing AI applications in both linguistic and non-linguistic tasks. However, this has primarily benefited well-resourced languages, leaving less-resourced ones (LRLs) at a disadvantage. In this paper, we highlight the current state of the NLP field in the specific LRL: kyrgyz tili. Human evaluation, including annotated datasets created by native speakers, remains an irreplaceable component of reliable NLP performance, especially for LRLs where automatic evaluations can fall short. In recent assessments of the resources for Turkic languages, Kyrgyz is labeled with the status 'Scraping By', a severely under-resourced language spoken by millions. This is concerning given the growing importance of the language, not only in Kyrgyzstan but also among diaspora communities where it holds no official status. We review prior efforts in the field, noting that many of the publicly available resources have only recently been developed, with few exceptions beyond dictionaries (the processed data used for the analysis is presented at https://kyrgyznlp.github.io/). While recent papers have made some headway, much more remains to be done. Despite interest and support from both business and government sectors in the Kyrgyz Republic, the situation for Kyrgyz language resources remains challenging. We stress the importance of community-driven efforts to build these resources, ensuring the future advancement sustainability. We then share our view of the most pressing challenges in Kyrgyz NLP. Finally, we propose a roadmap for future development in terms of research topics and language resources.
- Asia > Russia (0.14)
- Europe > Germany > Saxony > Leipzig (0.05)
- Asia > Kyrgyzstan > Chüy Region > Bishkek (0.04)
- (19 more...)
- Research Report (1.00)
- Overview > Growing Problem (0.34)
- Government (1.00)
- Media > News (0.46)
Assessing Large Language Models for Online Extremism Research: Identification, Explanation, and New Knowledge
Dong, Beidi, Lee, Jin R., Zhu, Ziwei, Srinivasan, Balassubramanian
The United States has experienced a significant increase in violent extremism, prompting the need for automated tools to detect and limit the spread of extremist ideology online. This study evaluates the performance of Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-Trained Transformers (GPT) in detecting and classifying online domestic extremist posts. We collected social media posts containing "far-right" and "far-left" ideological keywords and manually labeled them as extremist or non-extremist. Extremist posts were further classified into one or more of five contributing elements of extremism based on a working definitional framework. The BERT model's performance was evaluated based on training data size and knowledge transfer between categories. We also compared the performance of GPT 3.5 and GPT 4 models using different prompts: na\"ive, layperson-definition, role-playing, and professional-definition. Results showed that the best performing GPT models outperformed the best performing BERT models, with more detailed prompts generally yielding better results. However, overly complex prompts may impair performance. Different versions of GPT have unique sensitives to what they consider extremist. GPT 3.5 performed better at classifying far-left extremist posts, while GPT 4 performed better at classifying far-right extremist posts. Large language models, represented by GPT models, hold significant potential for online extremism classification tasks, surpassing traditional BERT models in a zero-shot setting. Future research should explore human-computer interactions in optimizing GPT models for extremist detection and classification tasks to develop more efficient (e.g., quicker, less effort) and effective (e.g., fewer errors or mistakes) methods for identifying extremist content.
- Europe > Germany (0.14)
- North America > United States > Virginia > Fairfax County > Fairfax (0.04)
- North America > United States > Washington > King County > Bellevue (0.04)
- (8 more...)
- Media (1.00)
- Law > Civil Rights & Constitutional Law (1.00)
- Law Enforcement & Public Safety > Terrorism (1.00)
- (7 more...)
Knowledge Graph Representation for Political Information Sources
Osmonova, Tinatin, Tikhonov, Alexey, Yamshchikov, Ivan P.
With the rise of computational social science, many scholars utilize data analysis and natural language processing tools to analyze social media, news articles, and other accessible data sources for examining political and social discourse. Particularly, the study of the emergence of echo-chambers due to the dissemination of specific information has become a topic of interest in mixed methods research areas. In this paper, we analyze data collected from two news portals, Breitbart News (BN) and New York Times (NYT) to prove the hypothesis that the formation of echo-chambers can be partially explained on the level of an individual information consumption rather than a collective topology of individuals' social networks. Our research findings are presented through knowledge graphs, utilizing a dataset spanning 11.5 years gathered from BN and NYT media portals. We demonstrate that the application of knowledge representation techniques to the aforementioned news streams highlights, contrary to common assumptions, shows relative "internal" neutrality of both sources and polarizing attitude towards a small fraction of entities. Additionally, we argue that such characteristics in information sources lead to fundamental disparities in audience worldviews, potentially acting as a catalyst for the formation of echo-chambers.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > New York (0.04)
- Europe > Germany > Bavaria > Lower Franconia > Würzburg (0.04)
- (3 more...)
- Government (1.00)
- Media > News (0.94)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.72)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.69)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.47)
HIVA: Holographic Intellectual Voice Assistant
Isaev, Ruslan, Gumerov, Radmir, Esenalieva, Gulzada, Mekuria, Remudin Reshid, Doszhanov, Ermek
Holographic Intellectual Voice Assistant (HIVA) aims to facilitate human computer interaction using audiovisual effects and 3D avatar. HIVA provides complete information about the university, including requests of various nature: admission, study issues, fees, departments, university structure and history, canteen, human resources, library, student life and events, information about the country and the city, etc. There are other ways for receiving the data listed above: the university's official website and other supporting apps, HEI (Higher Education Institution) official social media, directly asking the HEI staff, and other channels. However, HIVA provides the unique experience of "face-to-face" interaction with an animated 3D mascot, helping to get a sense of 'real-life' communication. The system includes many sub-modules and connects a family of applications such as mobile applications, Telegram chatbot, suggestion categorization, and entertainment services. The Voice assistant uses Russian language NLP models and tools, which are pipelined for the best user experience.
- Asia > Kyrgyzstan > Chüy Region > Bishkek (0.05)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Europe > Italy (0.04)
- (5 more...)
Applying Machine Learning Analysis for Software Quality Test
Khan, Al, Mekuria, Remudin Reshid, Isaev, Ruslan
One of the biggest expense in software development is the maintenance. Therefore, it is critical to comprehend what triggers maintenance and if it may be predicted. Numerous research have demonstrated that specific methods of assessing the complexity of created programs may produce useful prediction models to ascertain the possibility of maintenance due to software failures. As a routine it is performed prior to the release, and setting up the models frequently calls for certain, object-oriented software measurements. It is not always the case that software developers have access to these measurements. In this paper, the machine learning is applied on the available data to calculate the cumulative software failure levels. A technique to forecast a software`s residual defectiveness using machine learning can be looked into as a solution to the challenge of predicting residual flaws. Software metrics and defect data were separated out of the static source code repository. Static code is used to create software metrics, and reported bugs in the repository are used to gather defect information. By using a correlation method, metrics that had no connection to the defect data were removed. This makes it possible to analyze all the data without pausing the programming process. Large, sophisticated software`s primary issue is that it is impossible to control everything manually, and the cost of an error can be quite expensive. Developers may miss errors during testing as a consequence, which will raise maintenance costs. Finding a method to accurately forecast software defects is the overall objective.